Joining Data in R

Using dplyr to join data tables

Objectives

This document outlines some common data joining tasks with the dplyr package. If you’ve ever Googled ‘how to perform [x] in R’, you’ve probably discovered there are multiple ways to accomplish [x]. The methods I present here are not exhaustive, and I’ve tried to balance the trade-off between 1) solutions/functions that solve problems with the fewest number of keystrokes, and 2) code that’s readable, explicit, and easy to follow.

Load packages

dplyr is part of the core tidyverse packages, so we install and load this meta-package below.

Namespacing

I’ll be using the namespace syntax (package::function()) just to make it explicitly clear which function belongs to which package, but this isn’t necessary if you’ve run the code block above.

Inspecting data tables

Whenever I’m performing a series of joins, I like to use the inspectdf package. This package has a series of functions for viewing all the columns in a table by various types:

There are three principles for tidy data:

  1. Columns hold variables
  2. Rows hold observations
  3. Cells (the intersection of rows and columns) hold values

Joins

Joins give us the ability to combine multiple datasets on a common column. We’re going to be using the starwarsdb package, which dplyr::contains data from the Star Wars API.

A Data Model

The code below creates and displays the data model for the tables in this package

sw_data_model <- dm::dm_draw(dm = starwarsdb::starwars_dm(),
  graph_name = "sw_data_model",
  view_type = "all")
sw_data_model

As you can see from the graph above, there are nine tables in the starwarsdb. The tables connect to each other through a series of common columns (called keys), and these allow us to perform multiple kinds of joins.

Create sw_films

Load the starwarsdb::films dataset, removing the opening_crawl column, and filter to only episodes 4-6. Store these data in sw_flms

sw_flms <- starwarsdb::films |> 
  dplyr::select(-c(opening_crawl, director, producer)) |> 
  dplyr::filter(episode_id >= 4 & episode_id <= 6)
sw_flms
# A tibble: 3 × 3
  title                   episode_id release_date
  <chr>                        <int> <date>      
1 A New Hope                       4 1977-05-25  
2 The Empire Strikes Back          5 1980-05-17  
3 Return of the Jedi               6 1983-05-25  

left_join

A left_join keeps all of x, and joins it to all matching rows from dataset y

We want to left-join sw_flms to starwarsdb::films_vehicles:

starwarsdb::films_vehicles
# A tibble: 104 × 2
   title                   vehicle                     
   <chr>                   <chr>                       
 1 A New Hope              CR90 corvette               
 2 A New Hope              Star Destroyer              
 3 A New Hope              Sentinel-class landing craft
 4 A New Hope              Death Star                  
 5 A New Hope              Millennium Falcon           
 6 A New Hope              Y-wing                      
 7 A New Hope              X-wing                      
 8 A New Hope              TIE Advanced x1             
 9 The Empire Strikes Back Star Destroyer              
10 The Empire Strikes Back Millennium Falcon           
# … with 94 more rows

We can do this on "title", then we can re-organize the columns using dplyr::select() to place the vehicle column before dplyr::everything() else.

dplyr::left_join(x = sw_flms, 
          y = starwarsdb::films_vehicles, 
          by = "title")  |>   
  dplyr::select(vehicle, 
    dplyr::everything()) 
# A tibble: 47 × 4
   vehicle                      title      episode_id release_date
   <chr>                        <chr>           <int> <date>      
 1 CR90 corvette                A New Hope          4 1977-05-25  
 2 Star Destroyer               A New Hope          4 1977-05-25  
 3 Sentinel-class landing craft A New Hope          4 1977-05-25  
 4 Death Star                   A New Hope          4 1977-05-25  
 5 Millennium Falcon            A New Hope          4 1977-05-25  
 6 Y-wing                       A New Hope          4 1977-05-25  
 7 X-wing                       A New Hope          4 1977-05-25  
 8 TIE Advanced x1              A New Hope          4 1977-05-25  
 9 Sand Crawler                 A New Hope          4 1977-05-25  
10 T-16 skyhopper               A New Hope          4 1977-05-25  
# … with 37 more rows

Assign to sw_flms_veh

dplyr::left_join(x = sw_flms, 
          y = starwarsdb::films_vehicles, 
          by = "title")  |>   
  dplyr::select(vehicle, 
    dplyr::everything()) -> sw_flms_veh

Let’s take a look at the sw_flms_veh table:

sw_flms_veh |> dplyr::glimpse()
Rows: 47
Columns: 4
$ vehicle      <chr> "CR90 corvette", "Star Destroyer", "Sentinel-cl…
$ title        <chr> "A New Hope", "A New Hope", "A New Hope", "A Ne…
$ episode_id   <int> 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 5, 5, 5, 5,…
$ release_date <date> 1977-05-25, 1977-05-25, 1977-05-25, 1977-05-25…

We can see this table has 4 columns and 47 rows. We’ll use the inspect_types() function below to view the breakdown of the columns by type:

sw_flms_veh |> 
  inspectdf::inspect_types() |> 
  inspectdf::show_plot(
    text_labels = TRUE, col_palette = 1)

We can see most of these columns are character. Below we’ll display the categorical variables using inspectdf::inspect_cat():

sw_flms_veh |> 
  inspectdf::inspect_cat() |> 
  inspectdf::show_plot(
    text_labels = TRUE, col_palette = 1)

sw_flms_veh |> 
  inspectdf::inspect_num() |> 
  inspectdf::show_plot(
    text_labels = TRUE, col_palette = 1) 

inner_join

The inner_join() joins dataset x and y, and keeps only matching rows from both.

The code below uses an inner_join to join sw_films_veh to starwarsdb::pilots on "vehicle", and assigns it to sw_flms_veh_plt

dplyr::inner_join(x = sw_flms_veh, 
           y = starwarsdb::pilots, 
           by = "vehicle") -> sw_flms_veh_plt
sw_flms_veh_plt |> dplyr::glimpse()
Rows: 39
Columns: 5
$ vehicle      <chr> "Millennium Falcon", "Millennium Falcon", "Mill…
$ title        <chr> "A New Hope", "A New Hope", "A New Hope", "A Ne…
$ episode_id   <int> 4, 4, 4, 4, 4, 4, 4, 4, 4, 5, 5, 5, 5, 5, 5, 5,…
$ release_date <date> 1977-05-25, 1977-05-25, 1977-05-25, 1977-05-25…
$ pilot        <chr> "Chewbacca", "Han Solo", "Lando Calrissian", "N…
sw_flms_veh_plt |> 
  inspectdf::inspect_types() |> 
  inspectdf::show_plot(
    text_labels = TRUE, col_palette = 1)

sw_flms_veh_plt |> 
  inspectdf::inspect_cat() |> 
  inspectdf::show_plot(
    text_labels = TRUE, col_palette = 1)

right_join

The right_join() includes all the rows in dataset y, and joins it the matching rows in dataset x

The code below uses a right_join() to join sw_flms_veh_plt to starwarsdb::films_planets on "title", keeping all the rows in starwarsdb::films_planets.

Then we use select() to rename planet to flm_planet and assign this to sw_flms_veh_plt_plnt

dplyr::right_join(
  x = sw_flms_veh_plt,
  y = starwarsdb::films_planets,
  by = "title") %>%
  dplyr::select(
    vehicle:release_date,
    film_planet = planet,
    pilot) -> sw_flms_veh_plt_plnt
sw_flms_veh_plt_plnt |> dplyr::glimpse()
Rows: 183
Columns: 6
$ vehicle      <chr> "Millennium Falcon", "Millennium Falcon", "Mill…
$ title        <chr> "A New Hope", "A New Hope", "A New Hope", "A Ne…
$ episode_id   <int> 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4,…
$ release_date <date> 1977-05-25, 1977-05-25, 1977-05-25, 1977-05-25…
$ film_planet  <chr> "Tatooine", "Alderaan", "Yavin IV", "Tatooine",…
$ pilot        <chr> "Chewbacca", "Chewbacca", "Chewbacca", "Han Sol…
sw_flms_veh_plt_plnt |>
  inspectdf::inspect_types() |>
  inspectdf::show_plot(
    text_labels = TRUE, col_palette = 1)

sw_flms_veh_plt_plnt |> 
  inspectdf::inspect_cat() |> 
  inspectdf::show_plot(text_labels = TRUE, 
    col_palette = 1)

anti_join

The anti_join() keeps all rows in dataset x that do not have a match in dataset y.

Note the last join created empty values for vehicle, episode_id, release_date, and pilot:

sw_flms_veh_plt_plnt |> 
  dplyr::filter(is.na(episode_id)) |> 
  inspectdf::inspect_na() |> 
  inspectdf::show_plot(text_labels = TRUE, 
    col_palette = 1) +
  coord_flip()

The code below uses the anti_join() to join sw_flms_veh_plt_plnt to sw_films and create a dataset with only title and film_planet:

dplyr::anti_join(x = sw_flms_veh_plt_plnt, 
            y = sw_flms, 
            by = c("title", "episode_id", "release_date")) |> 
  dplyr::select(title, film_planet) |> 
  inspectdf::inspect_types() |>
  inspectdf::show_plot(text_labels = TRUE, col_palette = 1)

dplyr::anti_join(x = sw_flms_veh_plt_plnt, 
            y = sw_flms, 
            by = c("title", "episode_id", "release_date")) |> 
  inspectdf::inspect_na() |> 
  inspectdf::show_plot(
    text_labels = TRUE, col_palette = 1) + 
  ggplot2::coord_flip()

These should only contain the titles not in sw_films.